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Abstract 

The effects of policy sharing between agents in a multi-agent dynamical system has not been studied 
extensively. I simulate a system of agents optimizing the same task using reinforcement learning, 
to study the effects of different population densities and policy sharing. I demonstrate that shar- 
ing policies decreases the time to reach asymptotic behavior, and results in improved asymptotic 
behavior. 



I. INTRODUCTION 

Human society can be thought of as a system of many 
interacting intelligent agents in which learning is key. 
Learning completely independently is a very ineffective 
facilitation of knowledge, which is why people share in- 
formation with each other, allowing humans to better 
accomplish their tasks and goals. Enhancements in task 
performance through communication is observed in hon- 
eybee colonies [1] as well as in bacterial colonies [2] , which 
too can be modeled as systems of agents in a reinforce- 
ment learning system. It is clear that sharing information 
can improve the performance of multi-agent learning sys- 
tems. 

Sharing information has been shown to speed up task 
optimization in a reinforcement learning based multi- 
agent simulation, though not have an effect on the 
asymptotic performance of the learning task [3, 4]. This 
effect is not necessarily general in multi-agent intelligent 
systems. There are different types of information which 
can be shared between agents, in this paper we will only 
consider policy assimilation- when one agent absorbs the 
superior policy of another agent. 

In order to investigate the effects of sharing informa- 
tion on the optimization time and the asymptotic behav- 
ior of the system, I implement a robust learning algo- 
rithm to ensure that the learning algorithm can keep up 
with the evolution of the environment. Following such 
I investigate the effects of changing the population den- 
sity of the system, as well as the probability in which 
the agents share information. We will see that the sys- 
tem displays behavior which displays a dependence on 
information sharing in its asymptotic behavior. 

Often times the multi-agent reinforcement learning 
problem is studied by applying methods in single-agent 
reinforcement learning. We can use the presented results 
to understand how single-agent reinforcement learning 
can be extended and improved to multi-agent reinforce- 
ment learning through sharing information. The results 
also display a new type of effect on asymptotic behav- 
ior, more specifically that sharing can in fact change the 



asymptotic behavior of intelligent multi-agent systems. 

A. Reinforcement Learning Overview 

I will briefly cover reinforcement learning, and go in 
depth only in the primary algorithm used in the simu- 
lations. Information about other algorithms mentioned 
as well as all other aspects of this tutorial can be found 
elaborated in Ref. [5, 6]. 

Reinforcement (RL) learning is a subfield of machine 
learning concerned with finding the best set of actions 
for an agent in an environment such that its long term 
reward (from the environment) is maximized. The agent 
is the learner and decision maker. The environment is 
what the agent interacts with; it is everything out of the 
agent's immediate control. Each time the agent takes 
an action, it is presented with a new situation by the 
environment. We call this situation the agent's state. 
Though trial and error, the agent gradually discovers the 
best set of actions to take in certain states. 

The agent and the system interact in a sequence of dis- 
crete time steps. At each time step t, the agent receives a 
representation of the environment's state St € S, where 
S is the set of all possible states. Do not confuse the 
entire system with the environment, the environment is 
merely the agent's local observation. Nevertheless, the 
agent then takes an action at £ A(st), A(st) being the 
set of all possible actions in state St- At the next time 
step t + 1, the agent receives a reward r t +\ from the en- 
vironment, as well as its new state st+i- See figure 1. 

The agent's decisions are governed by its policy which 
maps states to actions: 

7T : S -> A(s) 

The agent aims to find the policy which maximizes its 
rewards. 



B. Key Aspects of Reinforcement Learning 
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The algorithms in RL are designed with the assump- 
tion that the system composed of the agent and its envi- 
ronment is Markovian, or that it has the Markov Prop- 
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FIG. 1: A RL flowchart [5] 



erty. In principle, there is no reason why it cannot con- 
tain any other types of information (for instance, mem- 
ory). In RL, the primary focus is on the decision making 
process, not designing the state signal. Accordingly, we 
want the state signal to be compact, which is why we 
have it reflect immediate sensory information. Though a 
system may not be Markovian, at least an approximation 
to a Markovian system is good enough in order for the 
RL algorithms to work properly. 

The policy tt(s, a) is a set of probability distributions 
specifying the probability the agent will take a certain 
action a given its states. This probability is given as 



7r(s, a) = p{a' = a\s}, 



and 



5>( a ,a') = l. 

a' 

The return at step t R t is defined 



Rt 



E 



7 n+k+i, 



where T is the episode duration (the number of time steps 
the system takes to reach a terminal state, meaning when 
the system has achieved an intermediate or final goal) 
and 7 G [0, 1] is the discount parameter. The discount 
parameter specifies present consideration of past events 
during an episode. 

The value function for a policy it is given by 

V*(s) = E„{R t \s t = s}. 

The value function tells us our expected returns given a 
state. It will tell us how good that state is: a state with 
a higher value function is more preferable because we 
expect a higher long-term reward. Similar to the value 
function, we define the action-value function for a policy 
7r as 

Q*(s, a) = E v {R t \ s t = s,a t = a} . 

The action-value function tells us how good it is to take 
a certain action in a certain state using the same logic as 
in the value function. It is easy to see that 



V*(8) 



E 
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Optimizing performance in a reinforcement learning sys- 
tem corresponds to maximizing our value and action- 
value functions. Accordingly, we aim to find the optimal 
value function and the optimal action-value function V* 
and Q* by 

V*(s) = maxV(s), 

77 

Q*(s, a) = maxQ 7r (s, a), 

for all s G iS, a G A(s). Thus we need some method of 
determining which policies are better than others. We 
do this by comparing value functions: 

7T > 71-' V*(S) > V n '(s) 

for all s G S. Following this, we can define the optimal 
policy 7T* as 

7T > 7T 

for all 7T. Note tt* is not necessarily unique. 

C. Policy Improvement 

The method involved in finding the optimal policy is 
fairly straightforward and applies generally to different 
methods of optimization. The steps are: 



-^7T* V*,Q\ 



E I 

with — ► and — > denoting policy evaluation and policy 
improvement, respectively. The methods of policy im- 
provement relevant here are e- greedy 



7r(s, a) 



l — s + s/\A(s)\ if a = argmax a / Q(s, a') 



e/\A(s) 



and softmax 



7r(s, a) 



if a ^ argmax a ' Q{s, a') ' 



Q(a,a)/r 



J2 a , e Q(s,a')/r' 



where the temperature parameter t specifies the random- 
ness of decisions. 



D. Q-Learning 

In Q-Learning (QL) we approximate the action-value 
function according to 

Q(s t ,a t ) := Q(s t ,a t )+ 

a\r t +i +7maxQ(s t+ i,a') - Q(s t ,a t )j , (1) 

with a being the learning rate and 7 (as before) the dis- 
count parameter [7]. Just like the discount parameter, 
we want a G [0, 1]. In order for a policy evaluated by 
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QL to converge in a discrete number of time steps, it is 
proven in Ref. [8] that 

oc 

a k = oo, 

k= J (2) 

k=l 

In order for the criterion in equation (2) to be met, we 
define 

<*k(s,a) = , 
k(s, a) 

where k(s, a) is the number of times the state-action pair 
s,a has been visited. I decided to use this method to 
ensure the accuracy of the approximation of the value and 
action-value functions, rather than methods in finding 
the most accurate constant-a value. For an approach to 
the latter method, see Ref. [9]. An implementation of QL 
can be found in algorithm 1. 



Algorithm 1 Q-Learning 
while True do 
a t := 7r(s t ) 
Observe r t +i, s t +i 
a(s t ,a t ) := a(st,a t )/(l + a(s t ,a t )) 
Q(s t ,a t ) ■■= Q{s t ,a t ) + 

a(s t , a t ) {n+i + 7max a Q(st+i,a) - Q(s t , a t )} 
Iteration := Iteration + 1 

if Iteration > Max Iteration or Terminal State then 
Update 7r (e-greedy or softmax) 
Iteration := 
end if 
end while 



II. THE SYSTEM 

The system I investigate is an enclosed two- 
dimensional square arena containing circular bot-agents. 
The density p corresponds to the surface area of the bots 
divided by the total area of the arena. Each bot has sen- 
sors covering N equal wedges of its circumference. The 
sensors are numbered 0, 1, • • • , N — 1. In the simulation, 
we fix N = 4. If there is some object touching the agent 
at some angle (j> from the agent's orientation, the n th 
sensor is activated, n given by 

N(f> 



For each state, all of the other N — 1 sensors can be on 
or off. Thus, we can define our state s as 

itf-1 

s = ]T M„2", 

n=0 



with /i n = 1 if the n sensor is activated, and (j, n = if 
it is not. According to this representation, there will be 
2 N possible states: 

5 = {0 ! 1,---,2 JV -1}. 

An illustration can be found in figure 2. The bots are 
smooth, hard disks, which collide with the walls and 
other bots completely inelastically. The simulation is ini- 
tialized by placing all of the agents in the arena, all at 
different random initial orientations #o,i; such that none 
overlap. 




FIG. 2: An illustration of two touching agents. The one to 
the left is in state 4, whereas the other agent is in state 1. 
If one of these agents broadcasts its policy, the other would 
receive the broadcast. 

Each bot-agent has a fixed set of actions A for all 
s € iS-either rotate to one of its other fixed, evenly spaced 
angles 1 , or move forward in a straight line at a universal 
constant speed along the orientation 9 of the bot-agent. 
Thus \A(s)\ — N for all s G S. Respectively, the next 
learning iteration occurs when the agent reaches its ter- 
minal orientation (it is done rotating towards its new an- 
gle), or it experiences a collision. These occurrences are 
called events. After an event occurs, the agent broad- 
casts its policy with probability p to all other adjacent 
agents touching the original agent's circumference (as in 
figure 2). Algorithm 2 displays how the policy sharing 
works, with B being set of bot-agents touching bot-agent 
A when A broadcasts its policy. This policy sharing algo- 
rithm is genetic in that it evaluates the fitness of a policy 
based on its value function, and proceeds with naive evo- 
lutionary selection. 



Algorithm 2 Policy Sharing Algorithm 
if Random < p then 
for B £ B do 

if tt a > ty b then 

V B {s) := V A (s)Vs 
Q B (s,a) := Q A (s,a)Vs,a 

^ B ■ ^ A 
TT := 7T 

end if 
end for 
end if 



Do not confuse the traditional notion of state with that 
used in RL: states are measured in discrete intervals dur- 
ing each time step, or learning iteration. When the agent 



1 If the bot-agent is oriented to angle 9, this corresponds to chang- 
ing this angle to 6 + 2mr/N for n g {1, 2, ■ • ■ , N - 1}. 
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Large Arena, p=0.0157 



(a)Small arena: (b)Large arena: 200 X 200, 
150 x 150, p = 0.279 p = 0.0236 

FIG. 3: Images of the small and large arenas used in the 
simulations. 



is moving there is no moving state associated with it. The 
agent's next state is determined by the readings from its 
sensors after an event. 

The simulations are run on two square arenas. If L is 
the length of a side of the arena and R is the bot-agent 
radius, the large arena has L/R = 20, whereas the small 
arena has L/R = 15. The large arena holds up to 50 
agents, and the small arena holds up to 25 agents. Some 
images of these different setups are found in figure 3. 

Each agent's task is to travel the greatest distance. We 
represent this task by specifying the reward the agent will 
receive after each action: 



-C + kD a 



with C, k S 



and D nt is the distance traveled as a 



result of action at- The C parameter is to discourage 
actions in which D is small, and the k parameter is to 
make sure that the rewards from traveling do not drown 
out C. Good behavior corresponds to short convergence 
times (quick learning) and a higher average speed (longer 
distance per unit time). 



A. Focusing on One Learning Algorithm 

I conduct the primary simulations using a single learn- 
ing algorithm. Though interesting behavior is not lim- 
ited to the best algorithm (the one which achieves the 
objective to the greatest extent), we choose one to re- 
duce the number of variables. I decided to choose the 
best-performing algorithm. This is because as the shar- 
ing and density parameters are varied, I want to be safe 
and not have buried behavior contained in the algorithm 
as it struggles to optimize a policy. A typical compari- 
son of learning algorithms considered is displayed in fig- 
ure 4. The behavior displayed is observed across differ- 
ent densities. It is clear that the softmax QL algorithm 
is quick in adapting to its environment. All simulations 
have 7 = 0.9 and r = 0.5. 
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FIG. 4: A comparison of learning algorithms for a low arena 
density. The e corresponds to using an e-greedy policy eval- 
uation, whereas the B corresponds to a softmax policy evalu- 
ation. There is no policy sharing. 
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FIG. 5: The vertical red line specifies the point in time when 
the system converged. The sharing probability is 1/2 in this 
case. 



III. RESULTS 

As we have already seen, the QL algorithm converges. 
This convergence shows itself as the constant average ve- 
locity per bot displayed in figure 4, as the agents' policies 
become static. This is the criterion we will use to deter- 
mine when the system converges-or when the average 
distance per bot becomes linear in time. That point is 
found by fitting to the tail of the curve, and extrapolat- 
ing the fit line backwards in time. The point where the 
fit curve deviates from the data is the threshold, or when 
the system displays its asymptotic behavior. See figure 5. 



A. Hypotheses 

For lower arena densities, the system is less compli- 
cated and, accordingly, the agents should experience sim- 
ilar situations more often and optimize more quickly. We 
should also see the effects of sharing more prominently 
in higher densities, as there will be more interactions per 
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bot per time step. We can also expect that in general 
some sharing should speed up convergence, however too 
much sharing might slow down convergence-the system 
will act greedily in the beginning, optimizing immediate 
rewards, impairing the efforts of exploration. Something 
that we cannot guess is where the fastest convergence lies 
along the sharing probability space. We are also unsure 
as to the effect of sharing probability p on the asymptotic 
behavior, though other sources mentioned earlier lead us 
to believe sharing does not affect the long-time behavior. 



B. Simulation Results and Discussion 

In figure 6 we see that for lower densities, the sharing 
does not have such a strong effect on the convergence 
time. Also, we observe that, in general, lower densities 
converge quicker. We observe similar behavior in the 
smaller arena, but it curiously takes more time for the 
smaller arena to converge. See figure 8. Regardless, the 
general trends seem to apply to both systems. 



Time Until Convergence 



Convergence Times 
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FIG. 7: A closer look at some sharing probabilities from fig- 
ure 6. Some unfitting spikes are due to artifacts in the conver- 
gence time calculations (see figure 5), for instance if the data 
ran closely parallel to the tail linear fit for a long duration of 
time. 



Time Until Convergence 




FIG. 8: Simulation results for the smaller arena. Notice sim- 
ilar behavior to that of the large arena in figure 6. 



FIG. 6: Simulation results for the large arena. Notice the 
fastest convergence times correspond to the highest sharing 
probabilities. 

Since we are curious about asymptotic performance, 
we investigate the asymptotic average velocity, as for ex- 
ample in figure 5. In figure 9, there appears to be no dif- 
ference between low sharing probabilities and high shar- 
ing probabilities. There does appear to be a difference 
between no sharing at all (independent) and any amount 
of sharing. In order to get a closer look at this ratio, 
refer to figure 10. We observe a heavy discrepancy from 
the independent (non-sharing) system in the asymptotic 
behavior of any sharing system as the density increases. 

When each agent is finding its own local optimal policy, 
it eventually will have to make some sort of preference 
towards how it resolves certain situations-for instance, 
to turn left or turn right after encountering a collision. 
These preferences become approximately permanent af- 
ter a policy is well established. These preferences will 
cause any given agent in the collective to butt heads, in 
a way, with the other agents, as it prefers (for instance) 



a counterclockwise vortex versus a clockwise one. When 
the agents are forced to swap policies, then the collective 
converges to a single preference. This exact behavior is 
displayed in figure 11. Make no mistake, the indepen- 
dent agents' policies do optimize, but with independent 
reinforcement learning it is too difficult for the agents to 
coordinate. 



IV. CONCLUSIONS 

We see that the time until convergence depends on the 
arena size, the arena density, and the sharing probabil- 
ity. The smaller arena experienced slower convergence 
times, the lower densities experienced faster convergence 
times, and the higher sharing probability experienced 
faster convergence times. The sharing probability yield- 
ing the fastest convergence time appears to be in the 
regime of 1, and not somewhere in between and 1 as 
we expected. 

Perhaps the most interesting result of this simulation 
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Convergent Average Speeds 




Pshare 

FIG. 9: Simulation results for the large arena, displaying the 
asymptotic performance of the collective. Notice the slightly 
darker band for no sharing (to the left), then the rest of the 
behavior is uniform. 



velops, each agent develops its own preferences towards 
resolving different states. These preferences become per- 
manent after the agent has a well established policy. If 
there is any sharing (p > 0), as the system runs for very 
long periods of time, all of the agents will adopt a uniform 
preference and the collective becomes coordinated. The 
specified task is being performed better in the sharing 
case than in the independent case. This improvement 
shows itself more prominently at greater densities, and 
appears to be independent of the arena size. 

The genetic policy sharing used (outlined in algo- 
rithm 2) is by no means the best algorithm. Variants 
might prove to perform better, such as averaging policies 
and value functions rather than erasing the past informa- 
tion an agent had accumulated. Regardless, these results 
can be explained quite plainly: I have demonstrated that 
the asymptotic behavior of a RL system can be improved 
through a policy sharing mechanism. 



Ratios of Convergent Average Velocity per Bot 
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FIG. 10: The ratio of the sharing terminal speeds over the 
independent terminal speeds as a dependency on p. 



Average Convergent Policies, Large Arena, p=0.157 



p=0.25 




FIG. 11: A low share probability p and a high share proba- 
bility display the collective all adapting the same preferences. 
With p = we see that the average policies level out, indicat- 
ing agents do not coordinate. 

is in contrast to Ref. [3, 4], we find that sharing poli- 
cies significantly affect on the asymptotic behavior of the 
system, especially for higher densities. As the system de- 



A. Future Work 

The fluctuations of the system are not discussed here. 
Such aspects of the system could yield interesting behav- 
ior. For example, if the emergent cooperative behavior of 
the system increases or decreases the fluctuations of the 
performance measure (total distance traveled per agent 
with respect to time). 

One could also study the effects and robustness of other 
learning algorithms in the context of the system. In 
Ref. [10], Maozu Guo et al. find that a learning algo- 
rithm based on simulated annealing performs very well. 
It would be interesting to see if a simulated annealing 
learning algorithm performs as well as QL, and to see 
how the system behaves under varying parameters with 
respect to the two learning algorithms. Additional stud- 
ies of alternative algorithms include varying the sharing 
algorithm, and introducing an inhomogeneity of learn- 
ing algorithms. Furthermore, one can introduce an inho- 
mogeneity of agents by, for example, varying the agent 
radius. 

A mediating algorithm can be introduced to coordi- 
nate the agents. This algorithm can be based on RL or 
some other subfield of machine learning. The effects of 
this meta-algorithm might introduce some very interest- 
ing behavior. 

The nature of the system lends itself well to a spacial 
diffusion simulation. This can be done by increasing the 
arena size, and placing many agents in a dense group 
at the arena center. In addition to studying spacial dif- 
fusion, one can study the diffusion of knowledge among 
the collective. One way this can be done by placing a 
master agent among dummy agents, and tracking how 
the knowledge diffuses, and how that diffusion affects the 
collective's task performance. 
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APPENDIX A: RUNNING THE SIMULATION 



In order to run the simulation, execute the sim.py 
file in the primary directory. Some help is included, 
as well as a readme file. Contact me with any ques- 
tions regarding the simulation output or the underly- 
ing processes. The simulation source code is available 

at http : // students . clarku . edu/ ~j ellowitz/f iles/release . tar . gz. 
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